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ABSTRACT 

The aim of the new generation of radio synthesis arrays such as LOFAR and SKA is to achieve 
much higher sensitivity, resolution and frequency coverage than what is available now, espe- 
cially at low frequencies. To accomplish this goal, the accuracy of the calibration techniques 
used is of considerable importance. Moreover, since these telescopes produce huge amounts 
of data, speed of convergence of calibration is a major bottleneck. The errors in calibration 
are due to system noise (sky and instrumental) as well as the estimation errors introduced by 
the calibration technique itself, which we call "solver noise". We define solver noise as the 
"distance" between the optimal solution (the true value of the unknowns, uncorrupted by the 
system noise) and the solution obtained by calibration. We present the Space Alternating Gen- 
eralized Expectation Maximization (SAGE) calibration technique, which is a modification of 
the Expectation Maximization algorithm, and compare its performance with the traditional 
Least Squares calibration based on the level of solver noise introduced by each technique. 
For this purpose, we develop statistical methods that use the calibrated solutions to estimate 
the level of solver noise. The SAGE calibration algorithm yields very promising results both 
in terms of accuracy and speed of convergence. The comparison approaches that we adopt 
introduce a new framework for assessing the performance of different calibration schemes. 

Key words: methods: statistical, methods; numerical, techniques: interferometric 



1 INTRODUCTION 

Early radio-astronomy predominantly used single-dishes for ob- 
servations. With the resolution requirements increasing, the sin- 
gle dish approach became impractical. This paved the path for us- 
ing radio-interferometric techniques with multiple antennas linked 
together as an array that operates as a large effective single-dish 
( [Thompson et al.|2001[ ). 

The sensitivity of an interferometer is greatly increased, com- 
pared to a single-dish telescope, due to the larger combined col- 
lecting area. The currently planned or built radio interferome- 
ters, such as the Square Kilometer Array (SKA|^ the Murchison 
Widefield Array (MWaQ the Precision Array to Probe Epoch of 
Reionization (PAPER|5 the 21 -cm Array (21CMAf] the Hydro- 
gen Epoch of Reionization Array (HERaFI the Long Wavelength 
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Array (LWA^and the LOw Frequency ARray (LOFArQ consist 
of a large number of elements and include short, intermediate and 
many of them longer antenna spacings. For an introduction to radio 
interferometry we refer the reader to |Thompson et al!| ( |2001^ . 

In the interferometric visibilities there always exist errors in- 
troduced by the sky, the atmosphere (e.g. troposphere and iono- 
sphere), the instrument (e.g. beam-shape, frequency response, re- 
ceiver gains etc.) and by Radio Frequency Interference (RFI). The 
process of estimating and reducing the errors in these measure- 
ments is called "calibration" and is an essential step before imaging 
the visibilities. 

The classical calibration method, named external (or primary) 
calibration, is based on observing a celestial radio source with 
known properties. This approach is strongly dependent on the ac- 
curacy with which the source properties are known. The external 
calibration is improved by using self-calibration ( [Pearson & Read-| 
[head[1984) l which utilizes the observed data for estimating both the 
unknown instrumental and the sky parameters. The quality of cal- 
ibration and the imaging is significantly increased by iterating be- 
tween the sky and the instrument model. The redundant calibration 

^ http://lwa.unm.edu 
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is also independent of the sky model. It calibrates for both the sky 
and the instrument, using redundant information in the measured 
data. However, its performance is limited to arrays with a regular 
arrangement in their antennas layout. 

Calibration is an optimization process that is non-linear by na- 
ture. It is in essence a Maximum Likelihood (ML) estimation of 
the unknown parameters by applying non-linear optimization tech- 
niques. Traditional calibration is estimating the ML solution by the 
non-linear Least Squares (LS) method via various gradient-based 
techniques such as the Levenberg-Marquardt (LM) algorithm ( |Lev-| 
|enberg|[l944[ |Marquardt||I963| l. This approach was improved by 
the Expectation Maximization (EM) algorithm ( Fed er & Weinsteiii] 
|I988[ l and later on by the Space Alternating Generalized Expec- 
tation Maximization (SAGE) technique which was introduced by 
[Fessler & Herol(T994) and was applied to interferometer calibra- 
tion by'Yatawatta et al.| ( [2009) . The analysis and application of the 
aforementioned schemes for the calibration of radio interferome- 
ters can be found in |Yatawatta et al.| ( |2009^ . 

To reach the scientific goals of the new generation of radio 
arrays, calibration algorithms must have the highest accuracy pos- 
sible. Furthermore, the number of measured visibilities that has to 
be calibrated is unprecedented. The speed of convergence of the 
calibration processes must therefore be the fastest with the mini- 
mum possible computational cost. Based on these facts, the best 
calibration method is referred to as the one which minimizes the 
"distance" between the true values of unknown parameters and the 
values obtained by calibration and minimizes computational time. 

We should take into account that the measured data of an inter- 
ferometer is always corrupted by different sources of noise such as 
the thermal noise, which is an additive Gaussian random process, 
and confusion noise (Condon 1974), which affects the coherency 
matrix (see next section and e.g. ,Bom & Wolf| l [I999[ >). For a de- 
tailed discussion on the sources of noise the reader is referred to 
the Chapter 6 of ' Wijnholds| ^2010| l. When the calibration process 
of the measured data is done, the "distance" between the true value 
of the unknown parameters and their calibrated solutions depends 
on the initial noise and the errors originating from the calibration 
process itself (e.g. converging to a local minimum), which is called 
"solver noise". In other words, because the calibrated solutions are 
not optimal, there always exists some solver noise between these 
solutions and the true values of the unknown parameters affected 
by the initial noise. The lower the solver noise, the higher the ac- 
curacy of the calibrated results. Thus, in order to increase the cali- 
bration efficiency, we need to choose the calibration scheme which 
has the minimum solver noise as well as the lowest computational 
cost. To achieve this, we should be able to compare these two fac- 
tors between various calibration techniques. We introduce a general 
framework for detecting the level of solver noise in calibration al- 
gorithms based only on their solutions. 

In this paper, we present the SAGE calibration method and 
emphasize its superiority, compared to the traditional LS calibra- 
tion, in terms of accuracy and speed of convergence. Mathemati- 
cal derivations of the algorithms are presented in the appendices. 
We also investigate the applicability of two well-known measures, 
the KuUback-Leibler Diverge nce (KLD) jKullback|1997) and the 
Likelihood Ratio Test (LRT) l |Graves|1978 1, in revealing the level 
of solver noise in calibrated solutions. Illustrative examples of both 
real and simulated observations, show the superior performance of 
the SAGE calibration compared to the LS one. They also indicate 
that the LRT approach is very promising at detecting the level of 
solver noise in the obtained calibrated solutions, while the KLD 
approach is not always conclusive. 



The following notations are used in this paper: Bold, lower- 
case letters refer to column vectors, e.g., y. Upper case bold letters 
refer to matrices, e.g., C. All parameters are complex numbers, un- 
less stated otherwise. The inverse, transpose, Hermitian transpose, 
and conjugation of a matrix are presented by (.)^^, {-Y^ , (O^' 
(.)*, respectively. The statistical expectation operator is referred 
to as E{.}. The matrix Kronecker product and the proper (strict) 
subset are denoted by ® and C, respectively. The diagonal matrix 
consisting of only the diagonal entries of a square matrix is given 
by diag(.). I is the identity matrix and is the empty set. The Kro- 
necker delta function is presented by 5ij . R and C are the sets of 
Real and Complex numbers, respectively. The Frobenius norm is 
shown by 1 1 . 1 1 . Estimated parameters are denoted by a hat, ( . ) . All 
logarithmic calculations are to the base e. 



2 THE MEASUREMENT EQUATION 

The first stage in the (self-)calibration process is to provide an ef- 
ficient measurement equation which relates the visibilities with the 
unknown sky and the instrument parameters. In this section, we 
use the measurement equation presented by |Hamaker et al.| ( |1996[ l. 
For assessing the equation from the array signal processing point 
of view, the reader is referred to iLeshem & van der Veen|2000l 
IBoonstra & van der Veen|2003||van der Tol et al.|2007| l. 

We assume that we have a radio interferometer consisting of 
A*' receiver antennas. Each antenna consists of two orthogonal dual- 
polarization feeds, which receive the incident polarized waves from 
astrophysical sources in the sky. We also assume that the radio 
frequency sky consists of K discrete, uncorrelated sources. The 
sources are far enough from the array that their radiations can be 
assumed to be plane waves. 



Let us consider = [exi &Yi\ represents the electric field 
vector of the i-th source. This field causes an induced voltage Vpi = 
[vxpi VY-pi]^ at antennap for every p G {1,2,..., A'^} due to: 

JpjG^. (1) 

In Eq. l[TJ, the 2x2 Jones matrix Jpi describes the complex inter- 
action between the fields, the antenna beam-shape and ionosphere, 
as well as the remaining signal path. The total signal at the antenna 
p, Vp, is a linear superposition of K such signals as in (jTJ. At the 
end, the receiver noise v = [vx i'y]'^ also is added to this signal. 

Before correlating the voltages of the interferometer antennas, 
each voltage is corrected for a geometric delay depending on the 
location of its receiver antenna on the earth. Thereafter, the p-th 
antenna voltage gets correlated to the other A'^ — 1 array antenna 
voltages in the array correlator. The correlated voltages, referred to 
as visibilities ( [Hamaker et al.|1996) of the baseline pq correspond- 
ing to the p-th and the q-lh antennas, E{vp v^}, can be given 
as 

K 

Vp, = ;^Jp,WC,J,'^W + Np„ p,qe{l,2,...,N}. (2) 



In Eq. |2|, the Jones matrices (Hamaker et al. 1996), 3pi{6) and 
Jqi{d) describe the electromagnetic interaction of the source i at 
antennas p and q, respectively ( |Bom & Wolf|1999) . In particular, 
the instrumental properties (the beam shape, low-noise amplifier 
gain, system frequency response, etc.) as well as the propagation 
properties (tropospheric and ionospheric distortions, etc.) are rep- 
resented by the Jones matrix formalism. They can be considered as 
the direction-dependent gains of the corresponding antennas for the 
i-th source. The unknown parameter vector G contains the 
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parameters of both the instrument and the sky model. Npq is the 
2x2 noise matrix of the baseUne pq. The coherency matrix ( |Bom| 
|& Wolf|1999l|Hamaker et al.|1996| > is defined as 



C, = E{e, ®ef}, 



(3) 



which provides us information about the polarization state of the ra- 
diation of the i-th source. We assume that an initial estimate of the 
coherency matrix C,; is known based on some prior information, 
about the source or sky properties, obtained by previous observa- 
tions. 

Calibration is essentially a step to estimate the elements of 
the unknown parameter vector d, i.e., P complex values or 2P real 
values. The parameter vector 6 is a random variable, that varies as a 
function of time and frequency, by nature. We assume without loss 
of generality that we are calibrating on a small enough time and 
frequency interval during which the variation of 6 is negligible. We 
then split the integration time to several sub-intervals and apply the 
calibration process to all of them separately. 

Finally, the vectorized form of Eq. l|2| can also be written as 



Vpq = vec(Vpq) = ^ J*,(^) ® Jpi(fl)vec(Ci) + rip 



(4) 



where n„ 



vec(Np5). Ignoring the auto-correlations for 



r T T 
[Vi2 Vi3 

r T T 
[ni2 ni3 . 



g, and stacking up all cross correlations as y 
]"^, and all noise vectors as n 



^(JV-l)JV 

surement equation as 



''(]V-l)]V 

nov_1^wl"^, we obtain the general form of the mea- 



y = ^s,(e) + n. 



(5) 



In Eq. (jsj, y, n G C*^,where M is at most 2A''(iV - 1) providing 
all the cross-correlations. The dimension of the parameter vector 0, 
P, is a multiple of KN. Thus, for a large enough N and a small 
enough K, there are enough constraints for estimating 0. However, 
when the number of sources in the sky is uncertain, the optimal 
K could be selected using Aikaike's Information Criterion (AIC, 
|Akaike| ( (l973[ l) as presented in |Yatawatta et al.H2009[ l. The nonlin- 
ear function Si{0), defined for i € {1, 2, ... , K} as 



s.(6l) 



J|,((9)® Ji.((9)vec(C0 
J^,((9)® Ji,((9)vec(C0 



J mid)' 



• (JV-l)i 



(e)vec(CO 



corresponds to the contribution of the i-th source in the observa- 
tion. The noise n is assumed to have a multivariate Gaussian dis- 
tribution with zero mean and M x M covariance matrix H, i.e., 
n ~ A/'(0,n) I Yatawatta et al.|[2009 1. Having the measurement 



equation in hand, one can apply different calibration techniques for 
estimating the ML of the unknowns. 



3 THE LS, EM, AND THE SAGE CALIBRATION 
ALGORITHMS 

In this section, we briefly discuss the Least Squares (LS, Nor- 
mal) calibration via the Levenberg-Marquardt (LM) algorithm. We 
also discuss the new robust calibration techniques, the Expectation 
Maximization (EM) and in particular the Space Alternating Gener- 
alized Expectation Maximization (SAGE) calibration algorithms. 



3.1 The LS calibration via LM algorithm 

LS calibration considers the additive noise A/^ to be a white Gaus- 
sian noise. Because the measurement equation shown in Eq. jSl has 
the general form of a non-linear regression model ( |Gallant[l975[ 
[Bates & Watts|2007] >, the likelihood of the unknown parameter is 
maximized when the sum of squared residuals is minimized. Thus, 
the ML estimation of will be equal to the below least squared 
error estimation 







arg mm 




Iy-Ef=is.wi|= 



(6) 



It is equivalent to minimize the distance between the observed vis- 
ibilities in y, and the predicted interferometer response as a super- 
position of K non-linear functions Si{0) for i G {1,2,..., K}. 
However, solving Eq. ^ suffers the same set of problems faced by 
any non-linear optimization problem, such as convergence to a lo- 
cal minimum, having a slow speed of convergence and significant 
computational cost. 

There are variousgradient-based optimization algorithms for esti- 
mating at Eq. |6l. The iterative LM algorithm is one of the most 
robust gradient-based optimization techniques in the sense that 
most of the time, given suitable initial suggestion 0'\ it converges 
to a global optimum. Considering (p{0) = ||y — X^iLi 
the cost function, the estimation of at the fc + 1-th iteration of the 
algorithm will be obtained by 



(7) 



In Eq. (|7jl, Ve is the gradient with respect to 0, and A is the 
damping factor which should be adjusted at each iteration (?). The 
matrix H — diag(VeVj<;^(^)) is the diagonal of the Hessian ma- 
trix. 

The EM algorithm, and in particular the SAGE algorithm improve 
the accuracy and computational cost compared with the LS cali- 
bration. Since they break the ML estimation problem into smaller 
problems, the computational cost will be decreased by an order of 
magnitude and the rate of convergence is substantially increased. 



3.2 The EM calibration algorithm 



In order to apply the EM algorithm ( [Feder & Weinstein|1988| l in 
calibration, we first need to extract a complete data space x from 
the observed data y. Similar to |Yatawatta et al.|p009| l, we consider 
the complete data space as x = [xi X2 . . . x^f ] in which Xi has 
the definition 



Xi = Si{0i) + n,, for i e {1,2,... K}. 



(8) 



In fact, Eq. ^ assumes that the contribution of the i-th source 
in the observation depends only on a subset of parameters, 0i. So, 
we partition the unknowns over the parameter vector correspond- 
ing to all the K sources in the sky as = [0i 0^ . . . 0'k]'^ . This 
partitioning is justifiable as each source is at a unique direction on 
the sky, and for each antenna, the signal path of all sources is the 
same. This is the case for our initial assumption where the sources 
are separated sufficiently. Also, the total noise is arbitrary decom- 
posed into K components, n,; for i G {1,2,..., A"}, such that 



(9) 



As the most convenient assumption, we let the noise components 
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riiS to follow statistically independent zero mean Gaussian distri- 
butions with the covariance matrix 

E{n,nf } = P-Ajn, (10) 

where 

K 

A £[0,1], forzG{l,2,...,7f}, = (11) 

j=i 

In principle, we can associate stronger sources with higher noise, 
hence higher /3iS, and vice-versa. 

Combining Eq. |5]( and Eq. l[8](, the observed data y will be derived 
from 

K 

1=1 

Therefore, for the given complete data space x we have 

y= [II...I]x = Gx, (13) 

where G is a block matrix containing the identity matrix I for K 
times. 

Having the definitions of complete and observed data in hand, the 
EM algorithm can be used to estimate the ML of the parameter 
vector 0. Applying the EM method for the new form of the 
measurement equation, Eq. the below Expectation (E) and 
Maximization (M) steps are developed at the k + 1-th iteration for 
i& {1,2,. ..,!<}. 

E Step: Calculating the conditional mean x*^ — ¥j{y^i\y,0^}. 
Considering the fact that x and y are jointly Gaussian we get 

K 

= s,{et) + ft(y -Y^siiOD). (14) 

(=1 

M Step: Finding O'l^^ such that minimizes the cost function 
4)r{di) = ||[xf - Si(fli)](ftn)~3|p with respect to 0,. This is 
also a non-linear optimization problem where the LM technique 
can be applied. The result is given by: 

= et - {\7e,vlM^-) + AH,)- Ve,0.(«Ole^ (15) 

where Hi = diag(V9,. Ve^'/'^^O)- 

We repeat the above two steps starting from iteration fc = 1 
until convergence or an upper limit which has been reached. Since 
at each iteration, i goes from 1 to K, the solutions of each source 
will be updated. In Appendix |Al[ we derive the EM algorithm for 
the problem in details and the results given above. 

3.3 The SAGE calibration algorithm 

SAGE algorithm (Fessler & H ero|1994l l performs better than the 
EM algorithm and has a higher speed of convergence and solution 
accuracy. The major difference between these two approaches is in 
the way of assigning the noise to the complete data space. 

Similar to applying the classical EM algorithm, the first step 
in the SAGE algorithm is to find a complete data space relating the 
observations to the unknown parameters. For this purpose, consider 
the set of all indices related to all the K sources as 

P = {1,2,...,K}. (16) 

Then, define index sets Wi such that 

/ m C P, (17) 



where for all a,b £ Wi, we have a ^ h, and the sources cor- 
responding to the indexes a and 6, source a and source h, have a 
small angular distance (they are near to each other in the sky) and 
subsequently they share some elements of the parameter vector d. 
We have 

WiHWj =0, fori/ j. (18) 
Let us assume that we have m such index sets. Thus, 

m 

m^K, P=[jW,. (19) 

i=l 

For each i £ {1, 2, . . . , m}, we define a new parameter vector 
consisting of all the elements in the parameter vector 6, which 
are affected by the sources with indexes in Wi. In other words, 
we provide the possibility to have elements in these new parameter 
vectors which are shared by more than one source. 

Now, we make a new partitioning over the parameter vector 6 

as 

fl = [fl^, e^, ...eJvj^. (20) 

Similar to |Fessler & Hero| ( |1994^ , we define the hidden data space 
as 

= ^ siXOw,) + n, (21) 

selecting the index set Wi £ {Wi, W2, . . ■ , Wm} which prefer- 
ably consists of the indices of the brightest sources. Note that in 
Eq. j2T} all the noise has been associated to the sources with in- 
dices in Wi. This is the main difference between the SAGE and the 
classical EM algorithm. Using Eq. l |21^ , the measurement equation 
can be written as 

m 

y = xw, + ^ ^ s, (Ow, ) . (22) 

By applying the EM algorithm to this new form of the measurement 
equation, we arrive at the following steps for the k + 1-th iteration 
of the SAGE approach: 

SAGE E Step: Computing the conditional mean x^ — 
E{'x.Wi\y,0''}- Since xvi/. and y are also jointly Gaussian, we get 

m 

^w, = ^ siietv.) + {y ^ si(0w^)) 

leWi j=i leWj 

m 

SAGE M Step: Finding fl^^ which is minimizing the cost 

function 4>w,{0w,) = il[x^^ - J2iew,^ii^Wi)](n)-i \\^ with 
respect to Owi- The result is similar to Eq. \15\ . As before, we 
iterate from fc = 1 to an upper limit. At each iteration, we change 
the index set Wi within {Wi, W2, ■ . ■ , W^} to update all or some 
sources. 

A special case of the SAGE algorithm was presented by 
[Yatawatta et al.| ( |2009| l if we consider Wi = {i} for all i £ 
{1,2,..., K}. In fact, we apply the same partitioning over the un- 
known parameter which is used for the classical EM algorithm, 
6 = [Ol 02... OrY" ■ By choosing the index i, where the source i 
is preferably the brightest source in the sky, the hidden data space 
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will be defined by 

Xi = 8,(^0 + n. (24) 
Eq. \24) gives us the definition of the observed data as 



y = Xi + ^ Si {6i 



(25) 



1=1 



and subsequently, applying the EM on the measurement equation, 
the k + 1-th iteration of the SAGE technique will be as below: 

SAGE E Step: Conditional mean xf — E{xi|y,S'^} is derived 

from 

K K 

5? = s,{et) + (y - ^ s,(0f )) = y -Y^siie'i). (26) 

1=1 1=1 

SAGE M Step: O'l^^ is given by minimizing the cost function 

0.(00 = ll[S?-s,(«0](n)-5|l^ 

In Appendix |A2[ we present the complete calculation process of 
applying the SAGE algorithm to the calibration problem. 



3.4 Computational Cost 

At each iteration of the LS calibration scheme via the LM algo- 
rithm, the non-linear system presented by Eq. (jTJ should be solved 
which is of order (KN). Therefore, ignoring the cost of calcula- 
tions for the inverse part in this equation, the computational cost 
will be 0{{K N)^). Furthermore, for radio synthesis arrays such as 
LOFAR and SKA, computing the matrix inverse in Eq. l|7j is very 
costly since the number of measured data is becoming very large. 
While, at each iteration of the EM algorithm, we should solve Eq. 
l |15^ K times and subsequently the computational cost of the EM 
calibration algorithm will be KO{N'^), which is still much cheaper 
compared with the LS calibration approach. Thus, the EM as well 
as the SAGE calibration techniques are superior to the LS one in 
terms of computational cost. 

Note that the LM optimization technique is employed for all the LS, 
EM, and the SAGE calibration algorithms. Thus, its corresponding 
inversion computation is shared in all the methods. However, in the 
EM and SAGE algorithms the size of the matrix that is inverted is 
smaller compared to that of the LS algorithm because of the parti- 
tioning procedure of the parameters. Given the fact that the compu- 
tational complexity of the matrix inversion scales a number of its 
elements to the third power, it is evident that inverting few smaller 
matrices as in the EM and SAGE cases is faster than inverting a 
single large matrix, which is the case for the LS algorithm. 



4 NOISE IN SOLUTIONS 

To compare the accuracy of the SAGE calibration scheme to the LS 
one, we statistically analyze the solver noise. The lower the solver 
noise in a calibration method, the smaller the errors in calibrated so- 
lutions provided by the calibration algorithm itself. Consequently, 
the accuracy of the method is higher. 

In order to do a proper statistical analysis of the calibration 
algorithm's solver noise, we make the assumption that their solu- 
tions are linear combinations of a deterministic trend and noise. 
This noise can have many origins. In the ideal case, it is introduced 
by the primary noise sources (thermal noise at the receiver, the sky 



noise, radio interference, etc.) and by variations of the instrumen- 
tal and propagation properties. However, in reality the solver noise, 
which we are most concerned about and is introduced by the cali- 
bration method itself, is also added to thoese sources. 

The goal in this section is to quantify the level of the solver 
noise for the different calibration algorithms, based on the evalua- 
tion of the statistical interaction between their solutions. 



4.1 Statistical similarity 

In order to compare the level of solver noise for the different cal- 
ibration methods, we assume that the true values of the solutions 
from different directions at the same antenna are statistically uncor- 
related. Therefore, any correlation between the calibrated solutions 
for different directions is caused by the corresponding calibration 
technique itself. In reality, there are also correlations that originate 
from the system noise in the solutions, but this can be ignored when 
we compare the solutions of a fixed measured data obtained by dif- 
ferent calibration methods. Therefore, a high solver noise in a cali- 
bration scheme causes strongly correlated solutions for any number 
of directions at one given antenna (or maybe even more). To detect 
the statistical similarity between the gain solutions we proceed as 
follows: 

Once we estimate the parameter vector 0, we obtain the Jones 
matrices Jqs{0) for different antennas q € {1, 2, . . . , A^} and dif- 
ferent directions s £ {1, 2, . . . , K}. Let us consider the matrices 
to be diagonal as 



Jll,q 








(27) 



where Jii,q and J22,q are complex values. We treat each gain so- 
lution from the direction s of antenna q as a random vector 0qa 
defined by 

Oqs = [lRe(Jii,,) 3m(Jii,q) %t{J22,q) 3m{J22,q)]T ■ (28) 

Now, we can investigate the statistical similarity between the 
gain solutions utilizing Kullback-Leibler Divergence (KLD) and 
Likelihood-Ratio Test (LRT). In general, both KLD and LRT com- 
pare the efficiency of fitting two different statistical models to a 
fixed set of measurements. Utilizing these methods on the random 
vectors defined by Eq. l |28[ ), we obtain their statistical similarity 
in two different interpretations. The higher these similarities, the 
higher the interaction between the solutions, as well as the solver 
noise. 



4.2 Kullbaclt-Leibler Divergence (KLD) 

An efficient way to quantify the statistical similarity between the 
solutions is to use KLD. 

The relative entropy, defined as the KLD, for each couple of 
Probability Density Functions (PDFs) / and g of solutions Oqk and 
Oqi, respectively, is defined as 

fiOqk) 



KLD(/,g) = ^/(0,fc)log- 



(29) 



where the solutions are coiTesponding to the given antenna q at the 
two different directions k and I. 

The KLD is a measure of information "divergence" between two 
different PDFs for the same random variable. Larger values of 
KLD(/, g) are interpreted as less interaction between the solu- 
tions, and subsequently, as less solver noise. We use a Monte-Carlo 
method to evaluate Eq. l|29[). 
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4.2.1 Density estimation 

To calculate the value of KLD we need to estimate the PDFs of the 
solutions. 

We define the PDF of the random vector Oqs, f{0qs;fi), as a 
mixture of L isotropic (scalar variance) Gaussian PDFs. The as- 
sumption of mixture modeling is based on the fact that the solu- 
tions are affected by parameters that belong to different underly- 
ing statistical populations and due to the Central Limit Theorem it 
is reasonable to assume that the distributions of those populations 
converge to Gaussian distributions.. Therefore, it can be written as 



(30) 



where g{0qs;mi, ai) is the PDF of a four dimensional Gaussian 
distribution with mean and variance erf I given by 



g{egs;mi,ai) = 



— ^= exp. 



(31) 



and yS is the vector of the mixture model unknown parameters 



P = [pi,ml ,ai,...,pL,ml,crLf , (32) 



which are estimated by the EM algorithm ( |Bilmes|1998^ . 



4.2.2 Akaike 's Information Criterion for model order selection 

To find the optimum number of L (the order of Gaussian mixture 
model in Eq. l |30[ l) we use Aikake's Information Criterion (AIC, 
|Akaike|1973T >. 

According to the definition of AIC, we select a L such that it 
gives us the minimum value of A\C{L) which is defined as 



AIC(L) = -2L(;3) + 2k, 



(33) 



where L(.) is the log-likelihood function of 0, is the maximum 
likelihood estimate of j8, and k is the number of parameters in the 
model presented by Eq. JSOl. 



4.3 Likelihood-Ratio Test (LRT) 

Another standard approach to investigate the statistical interaction 
between the solutions is the Likelihood-Ratio Test (LRT). Using 
this test, we can compare two models, which both can be fitted to 
our solutions. 

Let us define for each antenna q, where g G {1,2,..., A''}, and each 
pair of directions like k and /, where k, I £ {1, 2, K}, a new 
random vector Zqki as 



Iqkl 



(34) 



In fact, we are concatenating the solutions of the same antenna for 
two different directions together. Assume that Xqki is following a 
multivariate Gaussian distribution with mean 



and variance 



So 



s\eqk) 



\0q 



(35) 



(36) 



where m and are the sample mean and sample variance of the 
solutions respectively. The structure of the variance matrix Eo tells 
us that the statistical correlation between the components of the 
random vector Zqku or between the solutions Oqk and Oqi, is zero. 
This is exactly the desirable case in which the solver noise vanishes. 
Thus, we can consider this model as our null Ho model defined by 



Ho 



tqkl 



'M{m, Eo). 



(37) 



To investigate the validity of the null model compared with the case 
in which there exist some correlation between the solutions due to 
the presence of solver noise, we define the alternative H\ model as 



Hi : Zqki ~ A/'(in, El), 
where the variance matrix Ei is given by 

S^iOqk) Q{Oqk,0gl) 



El = 



Q{0qk,0ql) 



\0q,) 



(38) 



(39) 



and the 4x4 matrix Q{Oqk , Oqi ) denotes the sample covariance of 
the solutions. 

Using the above models, the Likelihood-Ratio is defined as 



A = -21n 



/ Likelihood for null model 



iLi 



Likelihood for alternative model / ' 



(40) 



which has a x distribution with 16 degrees of freedom. As A 
becomes smaller, the null model, in which the statistical correlation 
as well as the statistical similarity between the solutions is zero, 
becomes more acceptable compared to the alternative one. There- 
fore, the smallest the A, the less the solver noise and vice-versa. 

Note that the test result is reliable only when a large number 
of sample solutions is in hand. In this case, because of the Central 
Limit Theorem, the distribution of solutions tends to be a multivari- 
ate LS distribution, which is assumed initially by the test. 



5 ILLUSTRATIVE EXAMPLES 
5.1 Simulated observation 

First we use a simulated observation to compare the efficiency of 
the SAGE calibration algorithm with the LS algorithm. Utilizing 
simulations instead of real observations has the advantage of having 
the true underlying gains available, which is a luxury not available 
with real data. Therefore, assessing the convergence of the calibra- 
tion techniques as well as comparing the accuracy between differ- 
ent calibrated solutions is much more objective. 

We consider a linear, East- West radio synthesis array that has 
14 dipoles with dual polarization. We put three bright sources in 
our sky model named A, B, and C with intensities 2950, 2900, and 
2700 Jy, and three other weak sources named D, E, and F with in- 
tensities 4, 3.5, and 3 Jy, respectively. The simulated single channel 
image at 355 MHz is shown in Fig.[T] As we can see in Fig.[T| the 
weak sources are not visible in the image. We also consider that 
there is no beam pattern, therefore the whole sky is being observed 
with uniform sensitivity. 

As it is shown in Eq. (|2]), in the measured visibilities, the coherency 
of the sources are multiplied by the Jones matrices (gain errors). 
We consider the matrices to be diagonal. It means that the signal 
received at each dipole is not affected by the other one which is 
an ideal case. We produce gain errors in the norm and phase of 
the Jones matrices diagonal terms which are I and initially. We 
generate the norm and the phase of the gains as multiplications of 
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Figure 1. Single channel simulated observation of three bright sources, A, B, and C, and three weak sources, D, E, and F. The intensity of the bright sources 
are 2950, 2900, and 2700 Jy and of the weak sources are 4, 3.5, and 3 Jy, respectively. The image size is 8 by 8 degrees at 355 MHz. There are no gain errors 
and noise in the simulation. 




Figure 2. Simulated observation with added gain errors. The errors are complex numbers having norms and phases as multiplications of random numbers 
with various linear combinations of sin and cos functions. The gradients of the errors increase as a function of time and the phases are in different negative 
and positive slopes. 
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Figure 4. Residual image of the LS calibration using nine iterations. The calibration is processed only for the strong sources A, B, and C, that are completely 
removed in the residuals. The two weak sources D, and E appear in the image, but the weakest source F is hardly visible. 
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Figure 5. Comparison between the performance of the SAGE and the LS 
calibrations (nine number of iterations) in order of the accuracy of sohi- 
tions and the speed of convergence. Different points correspond to different 
snapshots made from the simulation. The speed of the SAGE calibration 
technique is higher than the LS calibration, while the norm of their residual 
errors are almost the same 
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Figure 6. Averaged Likelihood-Ratio of direction dependent gains (True 
values and calibrated ones) for all two source combinations between the 
bright sources A, B, and C 



random numbers with different linear combinations of ain and cos 
functions, whose gradients increase with time. We also add another 
random term increasing as a function of time, just to the phase er- 
rors, to provide the phases with positive and negative slopes. The 
simulated result is presented by Fig. [2] 

Finally, we apply the SAGE and the LS calibration to solve only 
for the gains of the visible strong sources A, B, and C. The residu- 
als of the SAGE and LS calibration using nine iterations are shown 
in Fig.|3]and Fig.|4] respectively. As we can see in Fig.|3]and Fig. 
|4] The strong sources are completely removed and the three weak 
sources are visible in the residuals of the both algorithms which 
mean both are converging to real solutions. But, the weak sources 
intensities in the residuals of the SAGE calibration are closer to the 
absolute intensities in our sky model which shows the superiority of 
the SAGE calibration in terms of accuracy. This fact is shown more 
clearly by table 1 5.1 [ where the real intensities of the weak sources 
are compared with the calibrated ones. 

Table 5.1 

Source DEE 
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Figure 7. Averaged KLD of direction dependent gains (True values and 
calibrated ones )for all two source combinations between the bright sources 
A, B, and C 



Real intensity(Jy) 4 3.5 3 

SAGE calibration 3.8399 3.6695 2.3085 
LS calibration 2.8327 2.6329 1.6081 



Note that as this is a single channel simulated observa- 
tion without any additive noise, the difference between the two 
method's residuals is slight. However, the importance of applying 
the SAGE instead of LS calibration is evident since the computa- 
tional cost of the SAGE is much smaller, as it is discussed in section 
|3.4| Fig.|5]shows the performance of the algorithms in terms of ac- 
curacy and speed of convergence. As we can see in Figure [5] the 
SAGE algorithm's speed of convergence is much higher than that 
of the standard LS calibration. 

In order to investigate the efficiency of the LRT and KLD ap- 
proaches in revealing the statistical similarity of gain solutions, we 
applied both methods, using Eq.|29]and Eq.|40] to the SAGE and the 
LS calibration solutions and compared their results with the LRT 
and KLD of the true Jones parameters (simulated gains). The com- 
parison is shown by Fig.|6]and Fig.|7] Fig.|6]exhibits an outstanding 



performance of the LRT approach in which the direction dependent 
gain's lowest statistical similarity belongs to the real Jones values, 
that is almost zero. For SAGE calibration's solutions this similar- 
ity becomes higher, and in LS results it reaches to its highest level. 
This result demonstrates the superior accuracy of the SAGE cali- 
bration's solutions regardless of the residual images. But, the KLD 
results in Fig. |7] show the same level of statistical correlation in 
the calibrations' solutions and in the True Jones parameters. This is 
due to the fact that these results are calculated by Gaussian mixture 
models, which are fitted to the gains. This characteristic of the KLD 
approach (using fitted PDFs for gains distributions rather than the 
true PDFs) decreases the methods sensitivity in revealing the level 
of statistical similarity between different directions' gains, espe- 
cially in our case where the simulated gains' correlations are low . 
However, we anticipate better performance of KLD method in real 
observations in which the higher gain errors plus the additive noise 
cause a higher solver noise. 
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Figure 8. Residual images around CasA (top row) and CygA (bottom row) obtained by the LS calibration (left column) and the SAGE calibration (right 
column) using twelve iterations. CasA is over-subtracted in the residual of the LS calibration result, due to the inaccuracy in the estimation of the relevant 
direction-dependent gain, while in the SAGE calibration result it is removed almost perfectly. On the other hand. Both the SAGE and the LS calibrations 
present some problems around CygA since at some point of the integration time it goes very close to the horizon. Even in that case, the subtraction residual 
for the SAGE algorithm is 10 percent lower than for the LS method. 



5.2 Real observations 

To illustrate the applicability of the KLD and LRT approaches in 
detecting solver noise in real observations calibrated solutions, we 
use the data from the example in'Yatawatta et al.|( |2009[ l. [Yatawatta| 
|et al.^ ( 2009 1 presents the calibration results using real data obtained 
during a 24 hours long LOFAR test core station (CSl) observa- 
tion, via SAGE and LS calibration techniques. The one channel 
images around 3C 461 (Cassiopeia A, CasA) and 3C 401 (Cygnus 
A, CygA) at 50 MHz after applying these calibration methods us- 
ing twelve iterations are shown in Fig. [8] The result at Fig.[8]clearly 
verifies the superiority of the SAGE calibration scheme as it was 
mentioned in ^Yatawatta et al.| ( |2009| l as well. We calculate the KLD 
using Eq. |29]as well as Likelihood-Ratio using Eq.|40]for the cal- 
ibrated solutions of these two sources obtained by the mentioned 



calibration techniques. In the KLD approach, we fit a Gaussian 
mixture model, which its components have full rank covariance ma- 
trices, to the solutions. The KLD and LRT results are shown in Fig 
|9]and Fig[TO] respectively. As we can see in Fig|9] the KLD of the 
solutions derived by the LS calibration is always lower than that of 
the SAGE calibration. It is also shown in Fig [To] that the LRT of 
the LS calibration's solutions is always higher than the SAGE cal- 
ibration's. Therefore, the solver noise in the LS calibration results 
is measurably higher than in the solutions obtained by the SAGE 
calibration. This means that the accuracy of the SAGE calibration 
is always higher than that of the LS calibration, which is visible 
in Fig. [8] as well as the aforementioned images in jYatawatta et al.] 
I ,2009| l. 
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Figure 9. KLD of the gain solutions for CasA and CygA 
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Figure 10. Likelihood-Ratio of the gain solutions for CasA and CygA 
6 SUMMARY 

Since the new generation of radio synthesis arrays are producing a 
large amount of data with high sensitivity, it is of great interest to 
devise new calibration techniques in order to increase the accuracy 
of solutions with the highest possible speed of convergence. 

In this paper, we presented the superior performance of the 
SAGE calibration scheme compared with the traditional LS cali- 
bration method. The superiority is in the sense that SAGE calibra- 
tion has the highest accuracy, the fastest speed of convergence, and 
the cheapest computational cost. Since both the algorithms are es- 
timating the ML of unknowns in different ways, it is possible that 
in some special cases ,such as having a very low initial noise in 
the measured visibilities, we don't have a specific difference be- 
tween the accuracy of their solutions. While, even in this case, 
the SAGE calibration's faster speed of convergence and cheaper 
computational cost justify its application instead of the LS calibra- 
tion. We compared the accuracy and the rate of convergence of the 
SAGE and the LS calibration in a simulated observation example. 
More accurate results in a much shorter time are obtained by the 
SAGE algorithm compared with the LS. The challenge in improv- 
ing the performance of the SAGE calibration technique is to find 
the best way of partitioning over the unknown parameter space. 
This can highly affect both the accuracy and speed of convergence 
of the calibration process. 



On the other hand, there always exists some estimation errors 
in the calibrated solutions. These errors are originated from the sys- 
tem noise (sky and instrumental) in the measurements, plus "solver 
noise" which is referred to errors produced by the calibration algo- 
rithm itself. The more accurate the calibrated solutions are, the less 
the amount of solver noise is. Based on this fact, the best calibra- 
tion method is the one which provides us with the minimum solver 
noise. KLD and LRT are utilized to reveal the level of solver noise 
in the solutions of different calibration schemes. We showed in il- 
lustrative examples that the LRT algorithm produces a very promis- 
ing result. The KLD method is rather inconclusive according to the 
initial assumption for the PDFs fitted to the solutions. We assumed 
that the distribution of the solutions is a mixture of Gaussian dis- 
tributions. However, in reality, the solutions may follow a differ- 
ent distribution and subsequently the KLD result may not have the 
same efficiency as the LRT's. Therefore, initially we should find the 
proper distribution which is appropriate for the solutions in order 
to calculate the KLD. 

The main direction of future work should be to investigate 
the application of the proposed calibration technique to real data 
obtained by LOFAR. Since LOFAR is observing the whole sky, 
the number of radio sources in the sky model will be very large. 
Subsequently, for applying the SAGE calibration, partitioning over 
the unknowns by manually checking the characteristics of all the 
sources will not be efficient. Therefore, the first challenge in utiliz- 
ing the SAGE calibration to real data would be automating the par- 
titioning over the unknowns for any given problem. Furthermore, 
we saw in the paper that all the mentioned calibration algorithms in- 
volve essentially the solution of a non-linear optimization problem. 
Applying suitable regularization techniques using proper smooth- 
ing functions to improve the accuracy of the solutions is an issue 
that must be investigated further in the near future. Add to that, 
that all the calibration schemes have the possibility of converging 
to a local optimum. Utilizing probabilistic techniques such as Sim- 
ulated Annealing (SA) ( [Kirkpatrick e t al. 1983) to assure that we 
are converging to a global optimum has the problem of decreasing 
the speed of convergence. Providing extra constraints for the men- 
tioned calibration schemes which can guarantee the convergence to 
the real solutions could be one of the challenging areas of research 
in the future. Moreover, we have shown that the solver noise cri- 
terion could be used for revealing the level of accuracy in the cal- 
ibrated solutions. Investigating possible systematic effects on the 
solver noise as well as the level of their influence are amongst the 
main issues for the future work. 
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Moreover, the complete data x and the observed data y are related 
by the below linear transformation 



y ■ 



[I I . . . I]x = Gx, 



(A6) 



where G is a block matrix consisting of the identity matrix I for K 
times. Thus, they are jointly Gaussian and for a parameter value 0' 
we have 

X = E{x|y;6l'} = s(6l')+EG'^[GEG^]"'[y-Gs(e')]- (A7) 
ijATj gives us 



X, = s.(e:)+/34y-^s,(e0 



(A8) 



as the i-th element of the vector x. 

In applying the EM algorithm, at fc + 1-th iteration we 
would like to find the parameter vector 6''^^ such that maximizes 
E{log/x(x;^)|F = y;^''}, where 0*' is the estimation of ob- 
tained at fc-th iteration. According to ijASj it is exactly equivalent 
to find O'l^^ for each i £ {1,2,...,_R'} such that minimizes 
E{(x, - s,(fli))«(ftn)-i(xi - si{6,))\Y = y-e"). Therefore, 
at the fc + 1-th iteration of the algorithm we have 



E Step: Calculate 

K 

xf = s,(0,^) + /3,[y-^s,(ef)]. 

M Step: Compute 

ef+i = argmin||[x,*-s,(eO](ftn)-5l| 

fori e {1,2,...,^!'}. 



(A9) 



(AlO) 



APPENDIX A: THE EM AND THE SAGE ALGORITHMS 
Al EM algorithm 

Considering the complete data x — x|' . . . x^J-]"^, where x^s 
are defined as jSj, its PDF will be equal to 

/x(x;e) 



In | |A1[ ( we have 



^ exp{-(x-s(6>))"E-'(x-s(e))}. (Al) 



and 



E = 



7?in o 
o 



o 
o 



(A2) 



(A3) 



O O ... /SkH 
Therefore, the log-likelihood of the complete data x is derived from 
log/x(x;0) =c-{(x-s(0))^E-i(x-s(e))}, (A4) 
where 

c=-log{^(™)lSl}. 
Substituting \A2) and ( |A3[ l in we can rewrite \AA) as 
log/x(x;0) = c 

(A5) 



A2 SAGE algorithm 

Since in the SAGE algorithm the complete data "X-Wi is defined as 
\2i\ , we have 

log/x^^(xw.;flwJ = -log{^*^|n|} 

(All) 

At fc + 1-th iteration of the algorithm we should cal- 
culate the parameter vector fl^^ which is maximizing 



E{log/x„,^(xw.;(9wJ|y = y;***}. Having (|aiT|, 6»^i can be 
derived by minimizing E{(xw; — EievK Si(^Wi))^n^^(xw. — 
^Yl,i<^w- ^'(^w'i))!^ ~ yi^'°} with respect to Qwi- Therefore, the 
SAGE algorithm's steps at the fc + 1-th iteration would be written as 



SAGE E Step: Calculate 



= E{XM/Jy,e'^} = y - ^ ^ s,(<.), (A12) 



which is derived from d23ll. 



SAGE M Step: Compute 



fvK, = arg mm | 



[xw-. 



E «'(«vy.)](n) 



(A 13) 
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fori G {1,2,..., A-}. 

The EM and in particular the SAGE algorithms have the well- 
known advantage that they ensure that the Ukelihood gets increased 
at each iteration step. However, they suffer from two negative fea- 
tures. Their rate of convergence is exponential, and they cannot 
guarantee the convergence to a global optimum. Nevertheless, they 
are significantly benefited from choosing a suitable starting point 
which can be the topic of the future work. 

This paper has been typeset from a TgXy KT^ file prepared by the 
author. 



